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The similarity between the Fast Fourier Transform and the Viterbi algorithm is 
exploited to develop a Concurrent Viterbi Algorithm suitable for a multiprocessor sys- 
tem interconnected as a hypercube The proposed algorithm can efficiently decode large 
constraint length convolutional codes, using different degrees of parallelism, and is attrac- 
tive for VLSI implementation. 


I. Introduction 

Concurrent computers have the potential to obtain large 
increases in computational power. This is true if one can find a 
concurrent decomposition of a given algorithm. 

We consider the Viterbi Algorithm for decoding (m, l/« 0 ) 
convolutional codes, where m is the memory (constraint 
length = m + 1) and 1 /n 0 is the code rate, and we describe an 
efficient decomposition of this problem, the Concurrent 
Viterbi Algorithm (CVA), which is suitable for multiprocessor 
systems with a Hypercube architecture. 

There are two key requirements in the problem decomposi- 
tion: 

(1) Divide the problem in equal parts, in order to share 
equally the execution time available in each processor. 

(2) Minimize the communication between the parts, so 
that each processor needs to share information only 
with its neighbors. 

Fox (Ref. 1) has shown that the Hypercube is a natural 
topology for the binary Fast Fourier Transform (FFT), and 
high efficiency can be obtained with this structure. We will 


show that there is a simple correspondence between the con- 
nectivity of the FFT algorithm and the trellis diagram of the 
Viterbi Algorithm. Therefore, efficient methods for imple- 
menting the Viterbi Algorithm on a Hypercube computer can 
be developed. 

II. Hypercube Computer Structure 

In general, it is desirable for an interconnection structure 
to have a low number of links per node (low degree of a node), 
and a small internode distance. This distance d between any 
two nodes x and y is defined as the minimum number of links 
required to send a message from x to y. The diameter D of a 
network with N nodes is defined as D = max {d xy | 0 < x , y < N } . 

A Boolean «-cube computer, or just Hypercube (Ref. 2), is 
a network of N = 2" processors placed at the vertices of an 
«-cube, and connected by its edges. Such a structure is attrac- 
tive because both the degree of a node and the diameter, which 
are equal to n, grow only linearly with the dimensionality of 
the cube. 

In a multiprocessor network there is always a trade-off 
between the degree of a node and the diameter, the extreme 



o(x,U k ) = a([x m _ 1 , . . . ,X Q ] , M fe ) 


case being the completely connected network which has 
N - 1 links per node and D = 1 . In general a low degree of a 
node implies a large diameter. 

The Hypercube structure seems a reasonable compromise 
for practical multiprocessor systems, except for the disadvan- 
tage of offering limited input/output capability This negative 
aspect, which is less important for computationally intensive 
(vs. I/O intensive) algorithms, is furthermore mitigated by the 
small diameter of the network, and the simplicity of efficient 
broadcasting methods. 


This is a cyclic right shift of x, with replaced by u k . Then 
the transformation a( ■ , •) completely describes the transitions 
of the graph in Fig. 1, and the m-stage transformation from a 
state x at stage 0 to state y at stage m is defined as 

y = o^ m \x,u) 


Two processors x and y will be called neighbors if their 
binary labels differ only in one position, i.e.. 
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where x k G {0, 1}, y = *„_ 2 x k , . . . ,x 0 ] ,and 

x k is the complement of x k . In this context, the distance d xy 
between two processors is just the Hamming distance of 
their binary labels, and neighbors have d xy = 1 . In the Hyper- 
cube, the number of nodes at distance d from a given node 
(node [0, 0, . . . , 0] can be considered without loss of general- 
ity) is N d = , so that the average distance is 
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III. Equivalence of Networks 

We now define two basic network structures the m -stage 
Viterbi algorithm trellis (de Brujin graph), shown in Fig. 1 for 
M=2 m =8 states, and the FFT decination-in-time graph of 
Fig. 2. 


= a(a( ... a(a(x, u Q ), w, )...), « m _, ) (2) 

The elementary transformation needed to describe Fig 2 is 
given by 

A 

- [*m_i > ■ ■ • ’ X k+1 ’ U k’ X k-1 X 0 ) 

which replaces x k by u k . The complete w-stage graph of Fig. 2 
is then described by the transformation co( m )(x, m), 

y = co (m) (*,«) 

A 

= w(ou( - . •Co(co(x,M 0 ),M 1 ) . . 

Now, it is easy to verify that, at the mih stage, 
a (m) (x,u) = w (m) (x, M ) 

Therefore, the two networks lead from a given state x to the 
same state y, after m stages. 

The equivalence of the two networks can be further 
expressed, at any stage, by defining the cyclic right shift 
operator 


,(*) 
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where p( m )(x) = p(°)(x) =x, and verifying that, 


Let x = [x m _ l , . . . ,x 0 ] be the binary index of a state or 
node of the graphs, and u = [« 0 , . ,,u k ,.. ,u m _ l ),u k E 

{0, 1}, be a sequence of input bits which define one of the 
two possible paths out of each node, where k represents the 
stage. 

First, consider the elementary transformation o(x,u k ) which 
describes the state transitions at stage k, and is defined as 


a w (x,M) = p (fc) (co (fc) (x,w)) (5) 

This result shows that, if we relabel each node x of the graph 
in Fig. 1 with the label x = pW(x) at stage k, the two networks 
are functionally and topologically equivalent; that is, they are 
just two different ways of drawmg the same network. A given 
path generated by an input sequence u visits the same nodes in 
Fig. 1 and Fig. 2, if we relabel each node x in Fig. 1 byp( fc )(x). 
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Having established this formalism on networks, we can now 
apply the above results to the study of algorithm structures, in 
particular to the Viterbi algorithm and its relationship with the 
radix -2 FFT. 

IV. Concurrent Viterbi Algorithm 

Consider a multiprocessor system with N processors located 
at the vertices of an n-cube and linked only by the edges of the 
cube (see an example for n = 3 in Fig. 3 [a]). Note that the 
nodes are labeled by an n-bit binary number, so that the ith bit 
is the coordinate of a node along the ith dimension. 

If in Fig. 2 we collapse the horizontal dimension, we obtain 
a graph which is exactly identical to that of Fig. 3, i.e., with 
the same connections between nodes. This observation, as 
explained in Refs. 1 and 3, suggests a natural way to imple- 
ment the FFT on a Hypercube computer. 

The implementation of the FFT on the Hypercube can be 
stated more formally if we define the Hypercube (m-cube) net- 
work by the transformation, 

? t X m-l ’ • • • >X k’ ' • ' ,X 0 ] = ^m-l ’ • • • ,X k’ • ■ ' ,x 0 1 

( 6 ) 

where x k is the complement of x k , and observe that the trans- 
formation in Eq. (3) can always be obtained by Eq. (6), since 
u k £ {0, 1} and u k = x k requires no communication (self- 
loop). 

To perform the first stage of Fig. 2, let the nodes of Fig 3 
communicate along the first dimension, and so on for each 
stage and dimension. In this way, the links provided by the 
n-cube are just those necessary to perform the FFT. This 
implementation on the n-cube is possible since the FFT 
requires communication only between neighboring nodes of 
the cube. 

At first, the network of Fig. 1 might seem to require com- 
munication between distant nodes on the cube. But this prob- 
lem can be easily overcome if we relabel the nodes as discussed 
in Sec. III. Specifically, at stage k, processor* will represent 
state pW(x) of the Viterbi trellis. Processor transitions are 
described by the graph of Fig. 2, while state transitions are 
given by Fig. 1, as desired. Therefore a Viterbi decoder can be 
efficiently implemented on the Hypercube, exactly as for the 
FFT. The similarity between the FFT and Viterbi trellis was 
previously pointed out by Forney (Ref. 3), but apparently not 
exploited in any practical way. 

Each processor, at stage k, receives the accumulated metric 
and survivor of its neighbor along dimension k of the cube, 


and performs the usual comparison and update. In a practical 
Viterbi decoder for a (m, l/« 0 ) code, the number of stages 
required to obtain a performance very close to optimum is 
approximately 5m. Notice that, when the stage & is a multi- 
ple of m, the state and processor labels are identical, since 
p (">)(*) = x, so that we may easily select the decoded bit 
belonging to the most likely survivor. However, in order to 
minimize internode communications, it may be more advan- 
tagenous to increase slightly the number of stages and read 
the decoded bit at node zero, which simplifies I/O operations 
(see Sec. V). 


V. Message Broadcasting 

Performing the CVA requires that blocks of data be loaded 
in every processor: This operation is called broadcasting. In 
the Hypercube (Ref. 2), data from the host processor is directly 
exchanged only through node zero (the origin of the cube). 
Therefore, an efficient concurrent method is required to broad- 
cast a message from node zero to all other nodes. Since the 
diameter of an n-cube is n, a lower bound on the broadcasting 
time is n time units (where the unit is the time to send a mes- 
sage to a neighbor). 

Assume that message A is in node zero, at time zero In 
each subsequent time slot k, send messages in parallel from 
each node * = [x ml , . . . , x k+i , 0, jc fc-1 , . . . ,* 0 ] to each 
node 7 (x), the neighbors along dimension k. After n time 
units, messaged has propagated to all nodes. 

Even though this method does not minimize the number of 
communications (with the advantage of a very simple index- 
ing), it optimizes the total broadcasting time to n time units. 
The result is clearly optimum, since it achieves the lower 
bound. 


VI. Decoders for Large Constraint Length 

Existing Hypercube computers have up to 128 nodes 
( n - 7), and will soon be extended to 1,024 nodes ( n = 10). 
Yet, in order to decode powerful convolutional codes with 
m > 10, one needs to obtain algorithms which assign more 
than one state per processor. This need is dictated not only by 
the practical limitations on the physical size of the computer, 
but also by the goal of achieving high computational efficiency. 

The efficiency ij of a parallel computer is defined as, 

o _ sequential algorithm time 
77 A r 0 t 0 + ^t t t ^ * (parallel algorithm time) 
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and the speed-up a is given by a = r)N, where N is the num- 
ber of processors, N a is the number of parallel operations 
(butterflies), t Q is the time required by each operation, N t is 
the number of parallel data transfers, and t t is the communi- 
cation time. 

When the number of states M of the decoder is larger than 
the available number of processors Af, states can be grouped in 
sets of 5 = 2 s states per processor, where M = SN. To see how 
this is possible for 5 = 2 in the proposed CVA implementation, 
consider Fig. 2 and group each pair of nodes into one proces- 
sor P yi! 2 \ > f° r nodes i, i = 0, . . . , 7, where[/J is the largest 
integer less than or equal to j . Similarly, with two processors 
P m , i = 0, ... ,7, we obtain the case .S' = 4. The extreme 
cases, S =M and S = 1 represent the completely sequential and 
completely parallel decoder, respectively. Intermediate cases 
represent different degrees of parallelism or a different granu- 
larity of the algorithm, which is defined as the amount of com- 
putation between successive data transfers. In general, the effi- 
ciency increases with the granularity. 


ratio t 0 /(t 0 + 2 t t ), which depends on the hardware implemen- 
tation. 

The CVA with 5 > 1 is useful because it allows the imple- 
mentation of complex decoders, avoiding the pin-limitation 
constraint problem encountered in existing VLSI decoders. 
This problem is due to the particular partitioning of the tradi- 
tional decoder into a branch metric computation block and a 
path memory storage block. This partition requires a rapidly 
increasing number of pins in the VLSI chips. The proposed 
CVA has instead a number of connections per node increasing 
only linearly with the dimensionality of the cube, and is there- 
fore more suitable for VLSI implementation 

Furthermore, the CVA has been extended to high rate 
( m , k 0 /n 0 ) codes with k 0 > 1, where k 0 is the number of 
input bits in the encoder. This extension is possible only for 
a limited range of m, k 0 and 5, the number of states per pro- 
cessor. An example for (3, 2 /n Q ) codes is given in Fig. 5, 
where 5 = 2. 


During m stages of the CVA, the number of parallel data 
transfers is, 


N t = S{m-s)=^n (7) 

since only m - s stages of Fig. 2 need to communicate with 
neighbors (s stages do only internal computations). The num- 
ber of parallel operations (butterflies)^ is given by 




( 8 ) 


As an example, N t and N 0 are plotted in Fig. 4 for M = 64. 
From Eqs. (7) and (8) we have that the efficiency r? decreases 
as 5 decreases from to 1 , and, for N = M,-q is equal to the 


Given a rate 1 /n Q code it is known how to generate all the 
punctured codes of rate k 0 /n 0 , k 0 > 1. Since these codes 
involve only pairwise comparisons at each stage, it is certainly 
possible to decode them with the CVA. It must be noted how- 
ever that punctured codes require more stages, and this is 
equivalent to linking nodes of the Hypercube which are not 
neighbors, by using multiple stages. 


VII. Conclusion 

The proposed CVA decoder has been implemented and 
successfully tested on a 64-node Hypercube computer for 
(m, l/« 0 ) codes, with m = 2, ...» 14. Present results con- 
firm the usefulness of the CVA for large constraint length 
codes. 
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(a) 1st STAGE (b) 2nd STAGE (c) 3rd STAGE 


Fig. 3. The 3-cube graph 
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